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KRIGING METAMODELS AND EXPERIMENTAL DESIGN FOR 
BERMUDAN OPTION PRICING 

MIKE LUDKOVSKI 


Abstract. We investigate two new strategies for the numerical solution of optimal stopping prob¬ 
lems within the Regression Monte Carlo (RMC) framework of Longstaff and Schwartz. First, we 
propose the nse of stochastic kriging (Gaussian process) meta-models for fitting the continuation 
value. Kriging offers a flexible, nonparametric regression approach that quantifies approximation 
quality. Second, we connect the choice of stochastic grids used in RMC to the Design of Experi¬ 
ments paradigm. We examine space-filling and adaptive experimental designs; we also investigate 
the use of batching with replicated simulations at design sites to improve the signal-to-noise ratio. 
Numerical case studies for valuing Bermudan Puts and Max-Calls under a variety of asset dy¬ 
namics illustrate that our methods offer significant reduction in simulation budgets over existing 
approaches. 


1. Introduction 

The problem of efficient valuation and optimal exercise of American/Bermudan options remains 
one of intense interest in computational finance. The underlying timing flexibility, which is math¬ 
ematically mapped to an optimal stopping objective, is ubiquitous in financial contexts, showing 
up both directly in various derivatives and as a building block in more complicated contracts, 
such as multiple-exercise opportunities or sequential decision-making. Since timing optionality 
is often embedded on top of other features, ideally one would have access to a flexible pricing 
architecture to easily import optimal stopping sub-routines. While there are multiple alternatives 
that have been proposed and investigated, most existing methods are not fully scalable across 
the range of applications, keeping the holy grail of such “black-box” optimal stopping algorithm 
out of reachh This is either due to reliance on approaches that work only for a limited subset of 
models, (e.g. one-dimensional integral equations, various semi-analytic formulas, etc.) or due to 
severe curses of dimensionality that cause rapid deterioration in performance as model complexity 
increases. 

Within this landscape, the regression Monte Carlo framework or RMC (often called Least 
Squares Monte Carlo, though see our discussion on this terminology in Section 2.2) has emerged 
as perhaps the most popular generic method to tackle optimal stopping. The intrinsic scalability 
of Monte Carlo implies that if the problem is more complex one simply runs more simulations, 
with the underlying implementation essentially independent of dimensionality, model dynamics, 
etc. However, the comparatively slow convergence rate of RMC places emphasis on obtaining 
more accurate results with fewer simulated paths, spurring an active area of research, see e.g. [1, 
4, 14, 20, 21, 29, 41]. 

In this article, we propose a novel marriage of modern statistical frameworks and the optimal 
stopping problem, making two methodological contributions. First, we examine the use of kriging 
meta-models as part of a simulation-based routine to approximate the optimal stopping policies. 


Work Partially Supported by NSE CDSE-1521743. 

^By black-box we mean a “smart” framework that automatically adjusts low-level algorithm parameters based 
on the problem setting. It might still require some high-level user input, but avoids extensive fine-tuning or, at the 
other extreme, a hard-coded, non-adaptive method. 
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Kriging offers a flexible non-parametric method (with several additional convenient features dis¬ 
cussed below) for approximating the continuation value, and as we demonstrate is competitive 
with existing basis-expansion setups. To our knowledge this is the first article to use kriging models 
in optimal stopping. Second, we investigate the experimental design aspect of RMC. We propose 
several alternatives on how to generate and customize the respective stochastic grids, drawing 
attention to the accompanying performance gains. Among others, we investigate adaptive gener¬ 
ation of the simulation designs, extending the ideas in Gramacy and Ludkovski [16] who originally 
proposed the use of sequential design for optimal stopping. Rather than purely targeting speed 
savings, our strategies explore opportunities to reduce the simulation budget of RMC through 
“smart” regression and design approaches. In particular, the combination of kriging and im¬ 
proved experimental design brings up to an order-of-magnitude savings in simulation costs, which 
can be roughly equated to memory requirements. While there is considerable overhead added, 
these gains indicate the potential of these techniques to optimize the speed-memory trade-off. 
They also show the benefits of approaching RMC as an iterated meta-modeling/response-surface 
modeling problem. 

Our framework is an outgrowth of the recent work by the author in [16]. In relation to the 
latter paper, the present article makes several modifications that are attractive for the derivative 
valuation context. First, while [16] suggested the use of piecewise-defined Dynamic Trees regres¬ 
sion, the kriging meta-models are intrinsically continuous. As such, they are better suited for the 
typical financial application where the value function is smooth. Second, to reduce computational 
overhead, we introduce a new strategy of batching by generating multiple scenarios for a given 
initial condition. Third, in contrast to [16] that focused on sequential designs, we also provide 
a detailed examination and comparison of various static experimental designs. Our experiments 
indicate that this already achieves significant simulation savings with a much smaller overhead, 
and is one of the “sweet spots” in terms of overall performance. 

The rest of the paper is organized as follows. Section 2 sets the notation we use for a generic 
optimal stopping problem, and the RMC paradigm for its numerical approximation. Along the 
way we recast RMC as a sequence of meta-modeling tasks. Section 3 introduces and discusses 
stochastic kriging meta-models that we propose to use for empirical fitting of the continuation 
value. Section 4 then switches to the second main thrust of this article, namely the issue of 
experimental designs for RMC. We investigate space-filling designs, as well as the idea of batching 
or replication. Section 5 marries the kriging methodology with the framework of sequential design 
to obtain an efficient approach for generating designs adapted to the loss criterion of RMC. In 
Section 6 we present a variety of case studies, including a classical bivariate CBM model, several 
stochastic volatility setups, and multivariate CBM models with multiple stopping regions. Finally, 
Section 7 summarizes our proposals and findings. 


2. Model 

We consider a discrete-time optimal stopping problem on a finite horizon. Let (W), t = 
0,1,...,T be a Markov state process on a stochastic basis (D,Too,IF’) taking values in some 
(usually uncountable) subset A C The financial contract in question has a maturity date of 
T < oo and can be exercised at any earlier date with payoff h{t, x) G M. Note that in the financial 
applications this corresponds to a Bermudan contract with a unit of time corresponding to the 
underlying discretization At = 1 of early exercise opportunities. The dependence of h{-,x) on t 
is to incorporate interest rate discounting. 

Let Tt = a{Xi:t), be the information filtration generated by (Xt) and S the collection of 
all (J-t)-stopping times bounded by T. The Bermudan contract valuation problem consists of 
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maximizing the expected reward h{T, X^) over all t £ S. More precisely, define for any 0 < t < T, 

(2.1) V{t,x):= sup Et,x[h{T,Xr)], 

T>t,rG(S 

where = x] denotes P-expectation given initial condition x. We assume that h{t, •) 

is such that V{t, x) < oo for any x, for instance bounded. 

Using the tower property of conditional expectations and one-step analysis, 

V{t,x)= sup E[h{T, XT-)\Xt = x] 

T>t,rG(S 

(2.2) = max {h{t, x), C{t + 1, W+i)) = h{t, x) + max(r(t, x), 0), 
where we defined the continuation value C{t,x) and timing-value T{t,x) via 

(2.3) C{t,x):=Et,x[V{t + l,Xt+i)]-, 

(2.4) T{t, x) := C{t, x) — h{t, x). 

Since V{t,x) > h{t,x) for all t and x, the smallest optimal stopping time T*{t,x) for (2.1) satisfies 

(2.5) {T*{t,x) = t} = {h{t,x) > C{t,x)} = {T{t,x) < 0}. 

Thus, it is optimal to stop immediately if and only if the conditional expectation of tomorrow’s 
reward-to-go is less than the immediate reward. Rephrasing, figuring out the decision to stop at 
t is equivalent to finding the zero level-set (or contour) of the timing value T{t,x) or classifying 
the state space X 3 Xt into the stopping region &t ;= {x : T(t,x) < 0} and its complement the 
continuation region. By induction, an optimal stopping time is 

(2.6) T*{t, x) = inf{s > t : Xs £ ©*} A T. 

From the financial perspective, obtaining the exercise strategy as defined by t* is often even more 
important than finding the present value of the contract. 

The representation (2.6) suggests a recursive procedure to estimate the optimal policy as defined 
by T*. To wit, given a collection Gt+i-.r of subsets of X which are stand-ins for the stopping 
regions, define the corresponding exercise strategy pathwise for a scenario ui: 

(2.7) f{t, x){uj) := inf{s > t : Xs{uj) £ S^} A T. 

Equipped with f, we obtain the pathwise payoff 

(2.8) Ht{X^-&t+i:T)){uj) := /i(f(t,x)(a;),X,(*,,)(,)(a;)). 

Taking expectations, Et^x[Ht{X^-,&t+i:T)\ is the expected payoff starting at x, not exercising 
immediately at step t and then following the exercise strategy specified by our notation 

highlights the latter (path-) dependence of Ht on future stopping regions &ts. If &s = &s for 
all s > t, we can use (2.8) to recover V{t,x). Indeed, the expected value of Ht in this case is 
precisely the continuation value Et^x[Ht{X^-,&t+i-.Ty\ = C{t,x). Consequently, finding C{t,x) 
(and hence the value function via (2.2)) is reduced to computing conditional expectations of the 
form X I—7- Et^x[Ht{Xi^)]. Because analytic evaluation of this map is typically not tractable, this 
is the starting point for Monte Carlo approximations which only require the ability to simulate 
(W)-ti'ajectories. In particular, simulating N independent scenarios x^.^, n = 1,... ,N emanating 
from Xt=x yields the estimator V{t,x) via 

^ 1 ^ 

(2.9) C'(t,x) := — ^F7t(x"-), V{t,x) := m.ax{h{t,x),C{t,x)). 

n=l 

A key observation is that the approximation quality of (2.9) is driven by the accuracy of the 
stopping sets &t+i-T that determine the exercise strategy and control the resulting errors (formally 
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we should thus use a double-hat notation to contrast the estimate in (2.9) with the true expectation 
C*{t,x) = Thus, for the sub-problem of approximating the conditional 

expectation of Ht{X.), there is a thresholding property, so that regression errors are tolerated 
as long as they do not affect the ordering of "&t,x[Ht{X.)\ and h{t,x). This turns out to be a 
major advantage compared to value function approximation techniques as in [42], which use (2.2) 
directly and do not explicitly make use of pathwise payoffs. See also [40] for a recent comparison 
of value-function and stopping-time approximations. 

2.1. Meta-Modeling. In (2.9) the estimate of Kt^x[Ht{X^)] was obtained pointwise for a given 
location (t, x) through a plain Monte Carlo approach. Alternatively, given paths n = 1,... ,N 
emanating from different initial conditions x”, one may borrow information eross-sectionally by 
employing a statistical regression framework. A regression model specifies the link 

(2.10) Ht{X^) = C{t,x) + e{t,x) 

where e{t, x) is the mean-zero noise term with variance cT^(t, x). This noise arises from the random 
component of the pathwise payoff due to the internal fluctuations in Xt:T and hence f(t, x). Letting 
= Ht{x^'-^) be a vector of obtained pathwise payoffs, one regresses against xl’^ to fit 
an approximation Cif, •). Evaluating C{t., •) at any x G A then yields the predicted continuation 
value from the exercise strategy conditional on Xt = x. 

The advantage of regression is that it replaces pointwise estimation (which requires re-running 
additional scenarios for each location x) with a single step that fuses all information contained in 
{xt,yt)^'^ to fit C{t,-). Armed with the estimated we take the natural estimator of the 

stopping region at t, 

(2.11) Gt := {x : C{t,x) < h{t,x)}, 

which yields an inductive procedure to approximate the full exercise strategy via (2.7). Note that 

(2.11) is a double approximation of Gt - due to the discrepancy between C{t,-) and the true 
expectation Kt^x[Ht{X^; St+i:^)] from (2.8), as well as due to partially propagating the previous 
errors in Gt+i:T relative to the true 

The problem of obtaining C{t, ■) based on (2.10) is known as meta-modeling or emulation in the 
machine learning and simulation optimization literatures [25, 3, 38]. It consists of two fundamental 
steps: first a design (i.e. a grid) V := x^'^ is constructed and the corresponding pathwise payoffs 
y^'^ are realized via simulation. Then, a regression model (2.10) is trained to link the y’s to x’s 
via an estimated C{t,-). In the context of (2.3), emulation can be traced to the seminal works 
of Tsitsiklis and van Roy [42] and Longstaff and Schwartz [30] (although the idea of applying 
regression originated earlier, with at least Carriere [8]). The above references pioneered classical 
least-squares regression (known by the acronyms Least Squares Method, Least Squares Monte 
Carlo and more generally as Regression Monte Carlo and Regression Based Schemes) for (2.10), 
i.e. the use of projection onto a finite family of basis functions {Rr(a;)}^=i, 

R 

C{t,x) = PrBrjx). 

r=l 

The projection was carried out by minimizing the distance between C{t,-) and the manifold 
T-Lr := span(Rr) spanned by the basis functions (akin to the definition of conditional expectation): 

(2.12) Cit, •) = arg infL 2 (/), 

where the loss function L 2 is a weighted norm based on the distribution of Xt, 

(2.13) L2(/) = \\Ht - = Eo,Xo [{Ht{X.) - f{Xt)f] . 
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Motivated by the weights in (2.13), the accompanying design is commonly generated by 

1.1. d. sampling from p{t, -lO, Xq). As suggested in the original articles [30, 42] this can be efficiently 

implemented by a single global simulation of A-paths although at the cost of introducing 

further t-dependencies among C{t, •)’s. 

Returning to the more abstract view, least-squares regression (2.12) is an instance of a meta¬ 
modeling technique. Moreover, it obscures the twin issue of generating D = . Indeed, in 

contrast to standard statistical setups, in meta-modeling there is no a priori data per se; instead 
the solver is responsible both for generating the data and for training the model. These two tasks 
are intrinsically intertwined. The subject of Design of Experiments (DoE) has developed around 
this issue, but has been largely absent in RMC approaches. In the next Section we re-examine 
existing RMC methods and propose novel RMC algorithms that build on the DoE/meta-modeling 
paradigm by targeting efficient designs V. 

2.2. Closer Look at the RMC Regression Problem. Eigure 1 shows the distribution of 

in a 1-D Geometric Brownian motion Bermudan Put option problem (see Section 4.4), 
the archetype of a financial application. The left plot shows a scatterplot of (x, Ht(X^) — h{t, x)) 
as X G M+ varies, and the right panel gives a histogram of Ht[Xffi for a fixed initial value Xt = x. 
Two features become apparent from these plots. First, the noise variance a'^{t,x) is extremely 
large, swamping the actual shape of x i— T{t,x). Second, the noise distribution e(f,x) is highly 
non-standard. It involves a potent brew of (i) heavy-tails; (ii) significant skewness; (iii) multi¬ 
modality. In fact, e(f, x) does not even have a smooth density as the nonnegativity of the payoff 
h(t, •) > 0, implies that Ht{X.) has a point mass at zero. 



X 



Figure I. Pathwise payoffs in a I-D Bermudan Put problem at t = 0.6. The 
underlying problem setting is described in Section 4.4. Left: scatterplot of 
{x,Ht{X^) — h{t,x)) over 200 distinct x G M+ (generated as a Sobol sequence). 
Right: Histogram of N = 200 pathwise future payoffs y ~ Ht[X^) starting at 
X = 35. The vertical dashed line indicates the empirical mean K[Ht{X^)\Xt = 
35] ~ Ave{y^''^) = 5.49. In 24 out of 200 scenarios the payoff was zero, creat¬ 
ing a point mass in the distribution of Ht{X^). Other summary statistics were 
StDev{y^'^) = 2.45, Skew{y^'^) = —1.28 and Max{y^'^) = 9.87. 


Moreover, as can be observed in the left panel of Figure I, the distribution of e(f, •) is het- 
eroscedastic, with a state-dependent conditional skew and state-dependent point mass at zero. 
These phenomena violate the assumptions of ordinary least squares regression, making the sam¬ 
pling distribution of fitted ffi ill-behaved, i.e. far from Gaussian. More robust versions of (2.12), 
including regularized (such as Lasso) or localized (such as Loess or piecewise linear models) least 
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squares frameworks have therefore been proposed, see [27, 28]. Further structured regression pro¬ 
posals can be found in [24, 29]. Alternatively, a range of variance reduction methods, such as 
control variates, have been suggested to ameliorate the estimation of /3 in (2.12), see for example 
[1, 4, 20, 21, 23]. 

The parametric constraints placed by the LSMC approach (2.12) on the shape of the continu¬ 
ation value C G 77make its performance highly sensitive to the distance between the manifold 
77 _r and the true C{t, •) [39]. Moreover, when using (2.12) a delicate balance must be observed 
between the number of scenarios N and the number of basis functions R. These concerns high¬ 
light the inflexible nature of parametric regression and become extremely challenging in higher 
dimensional settings with unknown geometry of C{t, •). One work-around is to employ hierarchi¬ 
cal representations, for example by partitioning X into disjoint bins and employing independent 
linear hyperplane fits in each bin, see e.g. Bouchard and Warin [6]. Nevertheless, there remain 
ongoing challenges in adaptive construction of Rr. 

Another limitation of the standard LSMC approach is its objective function. As discussed, 
least squares regression can be interpreted as minimizing a global weighted mean-squared-error 
loss function. However, as shown in (2.5), the principal aim in optimal stopping is not to learn 
C{t,-) globally in the sense of its mean-squared-error, but to rank it vis-a-vis Indeed, 

recalling the definition of the timing function, the correct loss function Lrmc (cf. [16]) is 

(2.14) Ljimc{C)=Kq Xo ^i)|l{signT(t,Xt) 7 ^signT(t,Xt)} ’ T{t, x) = C{t, x) — h{t, x). 

Consequently, the loss criterion is localized around the contour d&t = {C{t,x) = h{t,x)} = 
{T{t,x) = 0}. Indeed, a good intuition is that optimal stopping is effectively a classification 
problem; in some regions of X, the sign of T{t,x) is easy to detect and the stopping decision is 
“obvious”, while in other regions T{t,x) is close to zero and resolving the optimal decision re¬ 
quires more statistical discrimination. For example, in Bermudan options it is a priori clear that 
out-of-the-money (OTM) C{t,x) > h{t,x), and so it is optimal to continue at such {t,x). This 
observation was already stated in [30] who suggested to only regress in-the-money (ITM) trajec¬ 
tories. However, it also belies the apparent inefficiency in the corresponding design— the widely 
used LSMC implementation first “blindly” generates a design that covers the full domain of Xt, 
only to discard all OTM scenarios. This dramatically shrinks the effective design size, up to 80% 
in the case of an OTM option. The mismatch between (2.14) and (2.13) makes the terminology 
of “least-squares” Monte Carlo that exclusively emphasizes the L^-criterion unfortunate. 

While one could directly use (2.14) during the cross-sectional regression (for example to es¬ 
timate the coefficients [3 in (2.12), which however would require implementing an entirely new 
optimization problem), a much more effective approach is to take this criterion into account dur¬ 
ing the experimental design. Intuitively, the local accuracy of any approximation of C{t,x) is 
limited by the amount of observations in the neighborhood, i.e. the density of T> around x. The 
form of (2.14) therefore suggests that the ideal approach would be to place V along the stop¬ 
ping boundary d&t in analogue to importance sampling. Of course the obvious challenge is that 
knowing &t is equivalent to solving the stopping problem. Nevertheless, this perspective suggests 
multiple improvements to the baseline approach which is entirely oblivious to (2.14) when picking 

2.3. Outline of Proposed Approach. Motivated by the above discussion, we seek efficient 
meta-models for learning C{t,-) that can exploit the local nature of (2.14). This leads us to 
consider non-parametric, kernel-based frameworks, specifically Gaussian process regression, also 
known as kriging [3, 25, 45]. We then explore a range of Design of Experiments approaches for 
constructing T>. Proposed design families include (i) uniform gridding; (ii) random and deter¬ 
ministic space-filling designs; (hi) adaptive sequential designs based on expected improvement 
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criteria. While only the last choice is truly tailored to (2.14), we explain that reasonably well- 
chosen versions of (i) and (ii) can already be a significant improvement on the baseline. 

To formalize construction of "D, we rephrase (2.10) as a response surface (aka meta-) modeling 
(RSM) problem. RSM is concerned with the task of estimating an unknown function x i—)• f{x) 
that is noisily observed through a stochastic sampler, 

(2.15) Y{x) f{x) + e{x), E[e^{x)] = a'^(x), 

where we remind the reader to substitute in their mind Y{x) = Ht{X^) and / = t is 

treated as a parameter. Two basic requirements in RSM are a flexible nonparametric approxi¬ 
mation architecture Ti that is used for searching for the outputted fit /, and global consistency, 
i.e. convergence to the ground truth as number of simulations N increases without bound. The 
experimental design problem then aims to maximize the information obtained from N samples 
{x,yY''^ towards the end of minimizing the specified loss function L{f). 

The pathwise realizations of involve simulation noise which must be smoothed out and 

as we saw is often highly non-Gaussian. To alleviate this issue, we mimic the plain Monte Carlo 
strategy in (2.9) that uses the classical Law of Large Numbers to obtain a local estimate of 
C{t,x), by constructing batched or replicated designs. Batching generates multiple independent 
samples ~ y(x), z = 1,..., M at the same x, which are used to compute the empirical average 
y ~ M y^^'^ ■ Clearly, Y follows the same statistical model (2.15) but with a signal-to-noise ratio 
improved by a factor of M, E[e^(x)] = a‘^{x)/M. This also reduces the post-averaged design size 
\T>\, which speeds-up and improves the training of the meta-model. 

To recap, kriging gives a flexible non-parametric representation for C{t, •). The fitting method 
automatically detects the spatial structure of the continuation value obviating the need to specify 
the approximation function-class. While the user must pick the kernel family, extensive experi¬ 
ments suggest that these have a second-order effect. Also, the probabilistic structure of kriging 
offers convenient goodness-of-fit diagnostics after C is obtained, yielding a tractable quantification 
of posterior uncertainty. This can be exploited for DoE, see Section 5. 

3. Kriging Metamodels 

Kriging models have emerged as perhaps the most popular framework for DoE over continuous 
input spaces X. In kriging the cross-sectional interpolation is driven by the spatial smoothness of 
C{t, •) and essentially consists of aggregating the observed values of Y in the neighborhood of x. 
Book-length treatment of kriging can be found in [45], see also [25, Ch. 5] and [3]. To be precise, 
in this article we deal with stochastic kriging under heterogeneous sampling noise. 

Stochastic kriging meta-models follow a Bayesian paradigm, treating / in (2.15) as a random 
object belonging to a function space T-l. Thus, for each x, /(x) is a random variable whose posterior 
distribution is obtained based on the collected information from samples (x, y)^'^ and its prior 
distribution Qq. Let Q be the information generated by the design V, i.e. Q = a{Y{x) : x G x^''^). 
We then define the posterior distribution M.x{-) of /(x); the global map x i—)■ M.x is called the 
surrogate surface (the terminology is a misnomer, since Xix is in fact measure-valued). Its first 
two moments are the surrogate mean and variance respectively, 

(3.1) m{x) :=E[f{x)\Q]= [ yMx{dy), 

Jr 

(3.2) u(x)^ := E[(/(x) - m(x))^|^] = [ {y - m{x)fjWx{dy). 

Jr 

Note that in this function-view framework, a point estimate / of the unknown / is replaced with 
a full posterior distribution E[/|^] over the space JJ; tractability is achieved by imposing and 
maintaining Gaussian structure. The surrogate mean surface x i—)• m{x) is then the natural proxy 
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(being the maximum a posteriori probability estimator) for /, and the surrogate variance u^(-) is 
the proxy for the standard error of m(-). 

To model the relationship between f{x) and f{x') at different locations x,x' G X kriging uses 
the Reproducing Kernel Hilbert Space (RKHS) approach, treating the full /(•) as a realization 
of a mean-zero Gaussian process. If necessary, / is first “de-trended” to justify the mean-zero 
assumption. The Gaussian process generating / is based on a covariance kernel JC : —)■ M, 

with /C(x,x') = E[/(x)/(x')|0o]- The resulting function class T-Lfc ;= span(/C(-,x'),x' G X) forms 
a Hilbert space. The RKHS structure implies that both the prior and posterior distributions of 
/(•) given Q are multivariate Gaussian. 

By specifying the correlation behavior, the kernel /C encodes the smoothness of the response 
surfaces drawn from the GP %]c which is measured in terms of the RKHS norm || • W'Uf^. Two 
main examples we use are the squared exponential kernel 

(3.3) /C(x, x'] s, 9) = exp (—||x — x'||0/2) , 

and the Matern-5/2 kernel 

(3.4) }C{x,x'; s,9) = s^(l -|- \/5||x — x\\e + ^||x — x'W'pj ■ II®, 

o 

which both use the weighted Euclidean norm ||x||g = x • (diag0) • x'^ = Yl'j=i The length- 

scale vector 9 controls the smoothness of members of T-Lic, the smaller the rougher. The variance 
parameter determines the amplitude of fluctuations in the response. In the limit 9 —)■ -|-oo, 
elements of T-Lic concentrate on the linear functions, effectively embedding linear regression; if 
s = 0, elements of T-Lic are constants. For both of the above cases, members of the function space 
Hic can uniformly approximate any continuous function on any compact subset of X. 

The use of different length-scales 9j for different coordinates of x allows anisotropic kernels 
that reflect varying smoothness of / in terms of its different input coordinates. For example, in 
Bermudan option valuation, continuation values would typically be more sensitive to the asset 
price than to a stochastic volatility factor which would be reflected in 9i < 02 - 

Let y = (y(x^),... ,y{x^))'^ denote the observed noisy samples at locations x = x^'^. Given 
the data {x,y)^''^ and the kernel /C, the posterior of / again forms a GP; in other words any 
collection (•),... is multivariate Gaussian with means m{x'j), covariances u(x',x'), 

and variances u^(x') = u(x',x'), specified by the matrix equations [45, Sec. 2.7]: 

(3.5) m(x) = k{x)'^(K + 'S)~^y; 

(3.6) v{x, x') = /C(x, x') — fe(x)'^(K -|- Ti)~^k{x), 

with k{x) = (/C(x^, x),..., )C{x^, x))^, S := diag(cj^(x^),..., a‘^{x^)) and K the N x N positive 
definite matrix Kjj := /C(x*,x-^), 1 < i, j < N. The diagonal structure of S is due to the 
assumption that independent simulations have uncorrelated noise, i.e. use independent random 
numbers. Note that the uncertainty, u^(x), associated with the prediction at x has no direct 
dependence on the simulation outputs y^'^', all response points contribute to the estimation of 
the local error through their influence on the induced correlation matrix K. Well-explored regions 
will have lower u^(x), while regions with few or no observations (in particular regions beyond the 
training design) will have high u^(x). 

3.1. Training a Kriging Metamodel. To fit a kriging metamodel requires specifying kLic, 
i.e. fixing the kernel family and then estimating the respective kernel hyper-parameters, such as 
s, 9 in (3.4) that are plugged into (3.5)-(3.6). The typical approach is to use a maximum likelihood 
approach which leads to solving a nonlinear optimization problem to find the MLEs s*, 9* defining 
JC. Alternatively, cross-validation techniques or EM-type algorithms are also available. 



While the choice of the kernel family is akin to the choice of orthonormal basis in LSMC, in our 
experience they play only a marginal role in performance of the overall pricing method. As a rule 
of thumb, we suggest to use either the Matern-5/2 kernel or the squared-exponential kernel. In 
both cases, the respective function spaces are smooth (twice-differentiable for Matern-5/2 family 
(3.4) and (7°° for the squared exponential kernel), and lead to similar fits. The performance 
is more sensitive to the kernel hyper-parameters, which are typically estimated in a frequentist 
framework, but could also be inferred from a Bayesian prior, or even be guided by heuristics. 

Remark 3.1. One may combine a kriging model with a classical least squares regression on a set 
of basis functions {Br}. This is achieved by taking f{x) = t{x) + f{x) where t{x) = PrBr{x) 
is a trend term as in (2.12), j3r’s are the trend coefficients to be estimated, and f is the “residual” 
mean-zero Gaussian process. The Universal Kriging formulas, see [34] or [45, Sec. 2.7], then allow 
simultaneous computation of the kriging surface and the OLS coefficients (3. For example, one 
may use the European option price, if one is explicitly available, as a basis function to capture 
some of the structure in C{t, •). 

Inference on the kriging hyperparameters requires knowledge of the sampling noise a‘^{x). In¬ 
deed, while it is possible to simultaneously train K, and infer a constant simulation variance ci^ (the 
latter is known as the “nugget” in the machine learning community [38]), with state-dependent 
noise K, is not identifiable. Batching allows to circumvent this challenge by providing empirical 
estimates of <t^(x). For each site x, we thus sample M independent replicates y^^\x),... ,y^^\x) 
and estimate the conditional variance as 

, M M 

(3.7) a‘^{x):= jjr^Y^{y^^\x)-y{x)f, where y(x) = — ^ yW(x), 

i=\ i=l 

is the batch mean. We then use u^(a;) as a proxy to the unknown cj^(x) to estimate 1C. One 
could further improve the estimation of cr^(-) by fitting an auxiliary kriging meta-model from 
?^’s, cf. [25, Ch 5.6] and references therein. As shown in [33, Sec 4.4.2], after fixing K, we can 
then treat the M samples at x as the single design entry {x,y{x)) with noise variance a‘^{x)jM. 
The end result is that the batched dataset has just N' := N/M rows {x,y)^''^ that are fed into 
the kriging meta-model. 

For our examples below, we have used the R package DiceKriging [37]. The software takes the 
input locations x^'^', the corresponding simulation outputs y^'^' and the noise levels (t‘^{x^'^'), as 
well as the kernel family (Matern-5/2 (3.4) by default) and returns a trained kriging model. One 
then can utilize a predict call to evaluate (3.5)-(3.6) at given x's. The package can also implement 
universal kriging. Finally, in the cases where the design is not batched or the batch size M is very 
small (below 20 or so), we resort to assuming homoscedastic noise level that is estimated as 
part of training the kriging model (an option available in DiceKriging). The advantage of such 
plug-and-play functionality is access to an independently tested, speed-optimized, debugged, and 
community-vetted code, that minimizes implementation overhead and future maintenance (while 
perhaps introducing some speed ineffiencies). 

Remark 3.2. The kriging framework includes several well-developed generalizations, including 
treed GP’s [18], local GP’s [15], monotone GP’s [36], and particle-based GP’s [17] that can be 
substituted instead of the vanilla method described above, similar to replacing plain OLS with 
more sophisticated approaches. All the aforementioned extensions can be used off-the-shelf through 
public R packages described in the above references. 

3.2. Approximation Quality. To quantify the accuracy of the obtained kriging estimator C{t, •), 
we derive an empirical estimate of the loss function Lrmc- To do so, we integrate the posterior 
distributions Aixf) ~ N{m{x),v‘^{x)) vis-a-vis the surrogate means that are proxies for C{t,-) 
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using (2.14); 


(3.8) 


• I \y ^(^) ®) I l{m(a;)</i(t,a;)<j/U y<h(t,x)<m{x)}-^^x{dy^ 

JR 

1 / {m{z) - h{t,z))‘^' 


lo \/^v{z) 


exp 


2v^{z) 


dz 


/ N , / —\m(x) — hit, x)\\ , , . ,, ^ f— \m(x) — h(t,x)\ 

= v{x)(i)[ ' -|m(x)-/t(t,x)|4>' ' ^ 


v{x) 


v{x) 


where $,0 are the standard Gaussian cumulative/probability density functions. The quantity 
P({m(x) < h{t,x) < C{t,x)}[j{C{t,x) < h{t,x) < m{x)} | Q) is precisely the Bayesian posterior 
probability of making the wrong exercise decision at {t, x) based on the information from H, so 
lower indicates better performance of the design V in terms of stopping optimally. 

Integrating £rmc{') over X then gives an estimate for Lrmc{C)' 


(3.9) LRMciC;T>) := f iiiMcix',d^)p{t,x\0,XQ) dx. 

Jx 

Practically, we replace the latter integral with a weighted sum over T>, LjiMciC]'D) — n 
En^RMcix^;T^)pit,x^\0,Xo). 


3.3. Further RKHS Regression Methods. From approximation theory perspective, kriging 
is an example of a regularized kernel regression. The general formulation looks for the minimizer 
/ G of the following penalized residual sum of squares problem 

N 

(3.10) RSSif, A) = - /(^")}' + All/111,, 

n=l 

where A > 0 is a chosen smoothing parameter and R is a RKHS. The summation in (3.10) is a 
measure of closeness of / to data, while the right-most term penalizes the fluctuations of / to 
avoid over-fitting. The representer theorem implies that the minimizer of (3.10) has an expansion 
in terms of the eigen-functions 

N 

(3.11) f{x) = y~]an^(x,x"^), 

n=l 

relating the prediction at x to the kernel functions based at the design sites x^'^; compare to 
(3.5). In kriging, the function space is Ric, A = 1/2, and the corresponding norm || • ||.„ has a 
spectral decomposition in terms of differential operators [45, Ch. 6.2]. 

Another popular version of (3.10) are the smoothing (or thin-plate) splines that take ICtps{x, x') 
11X — x' I p log 11X — x' 11, where 11 • 11 denotes the Euclidean norm in In this case, the RKHS Rtps 
is the set of all twice continuously-differentiable functions [19, Chapter 5] and 


(3.12) 


2 

Htps 


/[' 

jR'i 


d 

E 

*j=i 


d__d^ 

dxi 5xi 


f{x) 


dx. 


This generalizes the one-dimensional penalty function ||/|| = f^{f”{x)}‘^dx. Note that under 
A = oo and (3.12) the optimization in (3.10) reduces to the traditional least squares linear fit 
/(x) = /3o + j3ix since it introduces the constraint f"{x) = 0. A common parametrization for 
the smoothing parameter A is through the effective degrees of freedom statistic df^; one may 
also select A adaptively via cross-validation or MLE [19, Chapter 5]. The resulting optimization 
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of (3.10) based on (3.12) gives a smooth response surface which is called a thin-plate spline 
(TPS), and has the explicit form 

d N 

(3.13) f{x) = Po + ^ f3jXj + ^ C(n\\x — log ||x — x"'||. 

j=l n=l 

See [26] for implementation of RMC via splines. 

A further RKHS approach, applied to RMC in [41] is furnished by the so-called (Gaussian) 
radial basis functions. These are based on the already mentioned Gaussian kernel ICrbf{x,x') = 
exp(—0jjx — x'jj^), but substitute the penalization in (3.10) with ridge regression, reducing the 
sum in (3.11) to N' N terms by identifying/optimizing the “prototype” sites x„/. 


4. Designs for Kriging RMC 

Fixing the meta-modeling framework, we turn to the problem of generating the experimental 
design V = x^'^. We work in the fixed-budget setting, with a pre-specified number of simulations 
N to run. Optimally constructing the respective experimental design dW) requires solving a 
Wdimensional optimization problem, which is generally intractable. Accordingly, in this section 
we discuss heuristics for generating near-optimal static designs. Section 5 then considers another 
work-around, namely sequential methods that utilize a divide-and-conquer approach. 

4.1. Space Filling Designs. A standard strategy in experimental design is to spread out the 
locations x^'^ to extract as much information as possible from the samples ■ A space-filling 
design attempts to distribute x^'s evenly in the input space A, here distinguished from the 
theoretical domain A of (Xf). While this approach does not directly target the objective (2.14), 
assuming A is reasonably selected, it can still be much more efficient than traditional LSMG 
designs. Examples of space-filling approaches include maximum entropy designs, maximin designs, 
and maximum expected information designs [25]. Another choice are quasi Monte Garlo methods 
that use a deterministic space-filling sequence such as the Sobol for generating x^'^. The simplest 
choice are gridded designs that pick x^'^ on a pre-specified lattice, see Figure 2 below. Quasi¬ 
random/low-discrepancy sequences have already been used in American option pricing but for 
other purposes; in [7] for approximating one-step value in a stochastic mesh approach and in [9] 
for generating the trajectories of (W). Here we propose to use quasi-random sequences directly 
for constructing the simulation design; see also [46] for using a similar strategy in a stochastic 
control application. 

One may also generate random space-filling designs which is advantageous in our context of 
running multiple response surface models (indexed by t) by avoiding any grid-ghosting effects. 
One flexible procedure is Latin hypercube sampling (LHS) [32]. In contrast to deterministic 
approaches that are generally only feasible in hyper-rectangular A’s, LHS can be easily combined 
with an acceptance-rejection step to generate space-filling designs on any finite subspace. 

Typically the theoretical domain A of (W) is unbounded (e.g. M!^) and so the success of a 
space-filling design is driven by the judicious choice of an appropriate A. This issue is akin 
to specifying the domain for a numerical PDE solver and requires structural knowledge. For 
example, for a Bermudan Put with strike K, we might take A = [0.6K,K], covering a wide 
swath of the in-the-money region, while eschewing out-of-the-money and very deep in-the-money 
scenarios. Ultimately, the problem setting determines the importance of this choice (compare the 
more straightforward setting of Section 6.3 below vs. that of Section 6.2 where finding a good A 
is much harder). 
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4.2. Probabilistic Designs. The original solution of Longstaff and Schwartz [30] was to con¬ 
struct the design V using the conditional density of Xt\XQ, i.e. to generate N independent draws 
x” ~ p{t, -jo, Xq), n = 1,..., N. This strategy has two advantages. First, it permits to implement 
the backward recursion for estimating Gt-i, Gt-2, ■ ■ ■ on a fixed global set of scenarios which 
requires just a single (albeit very large) simulation and hence saves simulation budget. Second, 
a probabilistic design automatically adapts to the dynamics of (Xt), picking out the regions that 
Xt is likely to visit and removing the aforementioned challenge of specifying X. 

Moreover, empirical sampling of x” reflects the density of Xt which helps to lower the weighted 
loss (2.14). Assuming that the boundary d&t is not in a region where p(f, •|0,Xo) is very low, 
this generates a well-adapted design, which partly explains why the experimental design of LSMC 
is not entirely without its merits. Compared to a space-filling design that is sensitive to X, a 
probabilistic design is primarily driven by the initial condition Xq. This can be a boon as above, 
or a limitation; for example with OTM Put contracts, the vast majority of empirically sampled 
sites x” would be out-of-the-money, dramatically lowering the quality of an empirical P. A good 
compromise is to take x" ~ Xt\{Xo, h{t, Xt) > 0} i.e. to condition on being in-the-money. 

4.3. Batched Designs. As discussed, the design sites x^’'^ need not be distinct and the macro¬ 

design can be replicated. Such batched designs offer several benefits in the context of meta¬ 
modeling. First, averaging of simulation outputs dramatically improves the statistical properties 
of the averaged y compared to the raw y’s. In most cases once M » 10, the Gaussian assumption 
regarding the respective noise e becomes excellent. Second, batching raises the signal-to-noise 
ratio and hence simplifies response surface modeling. Training the kriging kernel /C with very 
noisy samples is difficult, as the likelihood function to be optimized possesses multiple local max¬ 
ima. Third, as mentioned in Section 3.1, batching allows empirical estimation of heteroscedastic 
sampling noise Algorithm 1 summarizes the steps for building a kriging-based RMC solver 

with a replicated design. 

Batching also connects to the general bias-variance tradeoff in statistical estimation. Plain 
(pointwise) Monte Carlo offers a zero bias / high variance estimator of /(x), which is therefore 
computationally expensive. A low-complexity model such as parametric least squares (2.12), offers 
a low variance/high bias estimator (since it requires constraining / G T~Lr). Batched designs cou¬ 
pled with kriging interpolate these two extremes by reducing bias through flexible approximation 
classes, and reducing variance through batching and modeling of spatial response correlation. 

Remark 4.1. The batch sizes M can he adaptive. By choosing M(x) appropriately one can set 
d{x) = Var{Y{x)) = d^(x)/M(x) to any desired level. In particular, one could make all averaged 
observations homoscedastic with a pre-specified intrinsic noise d. The resulting regression model 
for y would then conform very closely to classical homoscedastic Gaussian assumptions regarding 
the distribution of simulation noise. 

As the number of replications M becomes very large, batching effectively eliminates all sampling 
noise. Consequently, one can substitute the empirical average y{x) from (3.7) for the true /(x), 
an approach first introduced in [43]. It now remains only to interpolate these responses, reducing 
the meta-modeling problem to analysis of deterministic experiments. There is a number of highly 
efficient interpolating meta-models, including classical deterministic kriging (which takes S = 0 in 
(3.5)-(3.6)), and natural cubic splines. Assuming smooth underlying response, deterministic meta¬ 
models often enjoy very fast convergence. Application of interpolators offers a new perspective 
(which among others is nicer to visualize) on the RMC meta-modeling problem. To sum up, 
batching offers a tunable spectrum of methods that blend smoothing and interpolation of /(•). 

Figure 2 illustrates the above idea for a 1-D Bermudan Put in a CBM model, see Section 4.4. 
We construct a design T> of macro size 9600 = 1600 • 6 which uses just six distinct sites T>’ = x^''^ , 
and M = 1600 replications at each site. We then interpolate the obtained values y(x^'®) using (i) 

12 



Algorithm 1 Regression Monte Carlo for Optimal Stopping using Kriging 

Require: No. of simulations N, no. of replications M, no. of time-steps T 
1: N' ^ N/M 
2: Set &T ^ ^ 

3: for t = T — 1,T — 2, ...,1 do 
4: Generate a macro-design Dt = ^ 

5: Use the stochastic sampler Y{x) = Ht{X ) in (2.8) with M replications per x G 

6: Batch the replications according to (3.7) to obtain the averaged (x, y)^'^ 

7: Fit a kriging meta-model C{t, •) based on (x, 

8: Obtain stopping set &t from (2.11) 

9: (Or replace steps 4-8 with a sequential design sub-problem using Algorithm 2 below) 

10: end for 

11: (Generate fresh out-of-sample (Xq, and approximate U(0, Xq) using (2.9)) 

12: return Estimated stopping sets ©i:T 


a deterministic kriging model; and a (ii) natural cubic spline. We observe that the resulting fit for 
T{t, •) shown in the Figure looks excellent and yields an accurate approximation of the exercise 
boundary. 



S. 


t 


Figure 2. Experimental design and estimated timing value T{t,-) using (i) deter¬ 
ministic kriging and (ii) a cubic spline interpolator. We use the manually chosen 
macro design V' = = {30,32,34,35,36,38}, with M = 1600 replications 

at each site. The boxplots summarize the resulting distribution of y(™'^(x")’s, 
m = 1,..., M. The dots indicate the batch means y’s which are interpolated ex¬ 
actly by the two meta-models. The underlying Bermudan Put is as in Section 4.4. 


4.4. Illustration. To illustrate different simulation designs we revisit the classical example of a 
1-D Bermudan Put option in a Geometric Brownian motion model. More extensive experiments 
are given in Section 6. The log-normal X-dynamics are 

(4.1) Xt+M = Xt exp (^(r - 5 - , AWt ~ X(0, At), 

and the Put payoff is h(t,x) = e“^*(X — x)+. The option matures at T = 1, and we assume 25 
exercise opportunities spaced out evenly with At = 0.04. This means that with a slight abuse of 

13 


































notation, the RMC algorithm is executed over t = T,T — At,..., At, 0. The rest of the parameter 
values are interest rate r = 0.06, dividend yield <5 = 0, volatility a = 0.2, and at-the-money strike 
K = Xq = 40. In this case 14(0 ,Xq) = 2.314, cf. [30]. In one dimension, the stopping region for 
the Bermudan Put is an interval [0, s(t)] so there is a unique exercise boundary s{t) = dGf 

We implement Algorithm 1 using a batched LHS design T) with N = 3000, M = 100, so at each 
step there are N' = N/M = 30 distinct simulation sites . The domain for the experimental 
designs is the in-the-money region X = [25,40]. To focus on the effect of various P’s, we freeze 
the kriging kernel JC across all time steps t as a Matern-5/2 type (3.4) with hyperparameters 
s = 1,0 = 4, which were close to the MLE estimates obtained. This choice is maintained 
for the rest of this section, as well as for the following Figures 4-5. Our experiments (see in 
particular Table 3) imply that the algorithm is insensitive to the kernel family or the specific 
software/approach used for training the kriging kernels. 

Figure 3 illustrates the effect of various experimental designs on the kriging meta-models. We 
compare three different batched designs with a fixed overall size = |P| = 3000: (a) an LHS 
design with small batches M = 20; (b) an LHS design with a large M = 100, and (c) a probabilistic 
density-based design also with M = 100; recall that the latter generates N/M = 30 paths of (Xf) 
and then creates M sub-simulations initiating at each x^. The resulting {xt, ytY'^' at t = 0.6 are 
shown in the top panels of Figure 3, along with the obtained estimates T{t,-). The middle row of 
Figure 3 shows the resulting surrogate standard deviations v{-). The shape of x i— v{x) is driven 
by the local density of the respective V, as well as by the simulation noise cr^(x). Here, a'^{x) 
is highly heteroscedastic, being much larger near at-the-money x ~ 40 than deep ITM. This is 
because in-the-money one is likely to stop soon and so the variance of Ht{X^) is lower. For LHS 
designs, the roughly uniform density of T> leads to the posterior variance being proportional to 
the simulation noise, v^{x) oc (T^(x); on the other hand the empirical design reflects the higher 
density of Xt closer to Xq = 40, so that v{x) is smallest there. Moreover, because there are very 
few design sites for x < 32 (just 5 in Figure 3), the corresponding surrogate variance has the 
distinctive hill-and-valley shape. In all cases, the surrogate variance explodes along the edges of 
"D, reflecting lack of information about the response there. 



Figure 3. Three different designs for fitting a kriging metamodel of the contin¬ 
uation value for the 1-D Bermudan Put {t = 0.6, T = 1). Top panels show the 
fitted T{t, •) as well as the distinct design sites x^'^'. Middle panels plot the corre¬ 
sponding surrogate standard deviation vix). Bottom panels display the loss metric 
l{x-,V) from (3.8). 
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The bottom panels of Figure 3 show the local loss metric for the corresponding 

designs. We stress that (rmc{') is a function of the sampled , and hence will vary across 
algorithm runs. Intuitively (.rmc{') is driven by the respective surrogate variance v‘^{x), weighted 
in terms of the distance \m{x) — h{t,x)\ to the stopping boundary. Consequently, large v‘^{x) is 
fine for regions far from the exercise boundary. Overall, we can make the following observations: 

(i) Replicating simulations does not materially impact surrogate variance at any given x' ^ P, 
and hence makes little effect on ^rmc{x')- Consequently, there is minimal loss of fidelity 
relative to using a lot of distinct design sites in V. 

(ii) The modeled response surfaces tend to be smooth. Consequently, the kernel K, should 
enforce long-range spatial dependence (sufficiently large lengthscale) to make sure that 

is likewise smooth. Undesirable fluctuations in can generate spurious stop¬ 

ping/continuation regions, see e.g. the left-most panel in Figure 3 around x = 28. 

(iii) Probabilistic designs tend to increase Lrmc{C) because they fail to place enough design 
sites deep ITM, leading to extreme posterior uncertainty v‘^{x) there, cf. the right panel 
in the Figure. 

(iv) Due to very low signal-to-noise ratio that is common in financial applications, replication 
of simulations aids in seeing the “shape” of (^(t, •) and hence simplifies the meta-modeling 
task, see point (ii). 

To give a sense regarding the aggregate quality of the meta-models. Figure 4 shows the estimated 
T{t, •) at three different steps t as the RMC implements backward recursion. The overall shape 
of T{t, •) remains essentially the same, being strongly positive close to the strike K = 40, crossing 
zero at s and remaining slightly negative deep ITM. As t decreases, the exercise boundary also 
moves lower. The Figure shows how the cross-sectional borrowing of information (modeled by 
the CP correlation kernel /C) reduces the surrogate variance v‘^{x) compared to the raw a‘^{x) 
(cf. the corresponding 95% CIs). We also observe that there is some undesirable “wiggling” of 
T{t, •), which can generate spurious exercise boundaries such as in the extreme left of the t = 0.2 
panel in Figure 4. However, with an LHS design this effect is largely mitigated compared to the 
performance of the empirical design in Figure 3. This confirms the importance of spacing out the 
design D. Indeed, for the LHS designs the phenomenon of spurious exercising only arises for t ~ 0 
and X very small, whereby the event {Xt < 30} that would lead to the wrong exercise strategy 
has negligible probability of occurring. 

5. Sequential Designs 

In this section we discuss adaptive designs that are generated on the fly as the meta-model 
learns Sequential design captures the intuition of focusing the sampling efforts around the 

exercise boundary d&t, cf. the loss function (2.14). This makes learning the response intrinsic 
to constructing the experimental design. In particular, Bayesian procedures embed sequential 
design within a dynamic programming framework that is naturally married to statistical learning. 
Algorithmically, sequential design is implemented at each time step t by introducing an additional 
loop over k = A^O; ■ ■ ■ ,N' that grows the experimental designs now indexed by their size k 
{k is the number of distinct sites, with batching possibly applied in the inner loop as before). As 
the designs are augmented, the corresponding surrogate surfaces AT ' and stopping regions 6) ^ 
are re-estimated and refined. The hnal result is a sequence of adaptive designs ■ ■ ■ 

and corresponding exercise policy. 

5.1. Augmenting the Experimental Design. Fixing a time step t, the sequential design loop 
is initialized with generated for example using LHS. New design sites x^°^^ ^ . 

are then iteratively added by greedily optimizing an acquisition function E{x) that quantifies 
information gains via Expected Improvement (El) scores. The aim of El scores is to identify 
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Figure 4. Evolution of the estimated T{t,-) over the backward induction steps. 
We show the snapshots at t = 0.6,0.4,0.2. LHS designs T> of size N = 3000 with 
M = 100 replications. The vertical “error” bars indicate the 95% quantiles of the 
simulation batch i = 1,..., M at x"’, n = 1,..., W (based on (3.7)); the dots 
show the batch means y, while the dotted lines indicate the 95% credibility interval 
(Cl) of the kriging meta-model 


locations x which are most promising in terms of lowering the global loss function L^MciC) 
from (2.14). In our context the El scores Ek{x) are based on the posterior distributions A4^\-) 
which summarize information learned so far about C{t, •). The seminal works of [12, 31] showed 
that a good heuristic to maximize learning rates is to sample at sites that have high surrogate 
variance or high expected surrogate variance reduction, respectively. Because we target (2.14), a 
second objective is to preferentially explore regions close to the contour {C{t,x) = h{t,x)}. This 
is achieved by blending the distance to the contour with the above variance metrics, in analogue 
to the Efficient Global Optimization approach [22] in simulation optimization. 

A greedy strategy augments with the location that maximizes the acquisition function: 

(5.1) x^^^ = arg sup^g;t> Ek{x). 

Two major concerns for implementing (5.1) are (i) the computational cost of optimizing over 
A and (ii) the danger of myopic strategies. Eor the first aspect, we note that (5.1) introduces 
a whole new optimization sub-problem just to augment the design. This can generate substan¬ 
tial overhead that deteriorates the running time of the entire RMC. Consequently, approximate 
optimality is often a pragmatic compromise to maximize performance. Eor the second aspect, 
myopic nature of (5.1) might lead to undesirable concentration of the design E interfering with 
the convergence of •) —?• C{t, •) as k grows. It is well known that many greedy sequential 

schemes can get trapped sampling in some subregion of A, generating poor estimates elsewhere. 
Eor example, if there are multiple exercise boundaries, the El metric might over-prefer established 
boundaries, and risk missing other stopping regions that were not yet found. A common way to 
circumvent this concern is to randomize (5.1), which ensures that E grows dense uniformly on A. 
In the examples below we follow [16] and replace arg supj,g_:^^ in (5.1) with argmaxa;g 7 - where 7~ 
is a finite candidate set, generated using LHS over some domain A again. LHS candidates ensure 
that potential new x^~^^ locations are representative and well spaced out over A. 

Kriging meta-models are especially well-suited for sequential design thanks to availability of 
simple updating formulas that allow to efficiently assimilate new data points into an existing ht. 
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If a new sample is added to an existing design and keeping the kernel K, 

fixed, the surrogate mean and variance at location x are updated via 


(5.2) 

(5.3) 


rrS^+^\x) = m^^\x) + \{x, x^^^)(/+i - m^^\x^+^))- 

^(fc+i)(3.)2 ^ _ \{x,x^+^-x^'-^f[a‘^{x^+^)+v^^\x^+^)\ 


where A(x, x^'^) > 0 is a weight function (that is available analytically) specifying the 
influence of the new x^'’'^-sample on x, conditional on existing design locations x^'^. Note that 
(5.2) and (5.3) only require the knowledge of the latest surrogate mean/variance and x^'^; previous 
simulation outputs do not need to be stored. Moreover, the updated surrogate variance 
.y(A:+i)(x)2 deterministic function of x^^^ which is independent of y^~^^■ In particular, the local 
reduction in surrogate standard deviation at x^^^ is proportional to the current u*^^)(x^+^) [II]: 


(5.4) 




1 - 


a{x‘ 


k+l) 




Remark 5.1. In principle, the hyperparameters of the kernel ought to he re-estimated as 
more data is collected, which makes updating C(t, •) non-sequential. However, since we expect the 
estimated to converge as k grows, it is a feasible strategy to keep K, frozen across sequential 
design iterations, allowing the use of (5.2)-(5.3). 


Algorithm 2 summarizes sequential design for learning C{t,-) at a given time-step f; see also 
Algorithm I for the overall kriging-based RMC approach to optimal stopping. 


Algorithm 2 Sequential design for (2.14) using stochastic kriging 

Require: Initial design size Nq, final size N, acquisition function E{-) 

1: Generate initial design := of size Nq using LHS over X 

2: Sample y^-^o initialize the response surface model 
3: Construct the stopping set using (2.11) 

A: k <— Nq 
5: while k < N do 

6: Generate a new candidate set of size D using LHS 

7: Compute the expected improvement Ek{x) for each x G T 

8: Pick a new location x^+^ = argmax^g.^(fe) Ej.{x) and sample the corresponding 

9: (Optional) Re-estimate the kriging kernel /C 

10: Update the surrogate surface using (5.2)-(5.3) 

11: Update the stopping set using (2.11) 

12: Save the overall grid U 

13: k ^ k-fl 

14: end while 

15: return Estimated stopping set and design 


Remark 5.2. The ability to quickly update the surrogate as simulation outputs are collected 
lends kriging-models to “online” and parallel implementations. This can he useful even without 
going through a full sequential design framework. For example, one can pick an initial budget N, 
implement RMC with N simulations, and if the results are not sufficiently accurate, add more 
simulations without having to completely restart from scratch. Similarly, batching simulations is 
convenient for parallelizing using GPU technology. 
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5.2. Acquisition Functions. In this section we propose several acquisition functions E{x) to 
guide the sequential design sub-problem (5.1). Throughout we are cognizant of the loss function 
(2.14) that is the main criterion for learning C{t, •). 

One possible proposal for an El metric is to sample at locations that have a high (weighted) 
local loss dehned in (3.8), i.e. 

(5.5) Ef^(x) ;=^('=)(x)-p(t,x|0,Ao). 


The superscript ZC stands for zero-contour, since intuitively E^^{x) measures the weighted dis¬ 
tance between x and exercise boundary, see [16]. E^^ targets regions with high v^^\x) or close 
to the exercise boundary, |m*^^^(x) — h{t, x)] ~ 0. 

A more refined version, dubbed ZC-SUR, attempts to identify regions where i^^\x) can be 
quickly lowered through additional sampling. To do so, ZC-SUR incorporates the expected dif¬ 
ference E[£(x; 22^*^^) — 2(x;P(^^ U x)] >0 which can be evaluated using the updating formulas 
(5.2)-(5.3). This is similar in spirit to the stepwise uncertainty reduction (SUR) criterion in 
stochastic optimization [34]. Plugging-in (2.14) we obtain 


(5.6) 

EI^C-SURf^x) ■=p{t,x\0,Xo) • |x(^)(x)(?i ^ uW(l')^ ) 


— (p 


( -dW(x) \ 

^u(^+l)(x) J 


-d^^\x) 


$ 


-d^^\x) \ _ / -d(^)(x) \l 1 
x(^)(x) / \x(^+^)(x)/ y 


d^^\x) := |m^^^(x) — h{t,x)\. 


Figure 5 illustrates the ZC-SUR algorithm in a 2D Max-Call setting of Section 6.2. The left 
panel shows the initial Sobol macro design of Nq = 50 sites. The right panel shows the 

augmented design At each site, a batch of M = 80 replications is used to reduce intrinsic 

noise. We observe that the algorithm aggressively places most of the new design sites around the 
zero-contour of ©t, primarily lowering the surrogate variance (i.e. maximizing accuracy) in that 
region. 



Figure 5. Sequential learning of the exercise boundary dGt using a ZC-SUR 
expected improvement criterion (5.6) for generating T). We show two intermediate 
designs along with corresponding boundaries The underlying model is 

the 2D Max Call from Section 6.2. 
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The empirical integrated local loss Lrmc from (3.9) provides an indicator of algorithm perfor¬ 
mance by quantifying the accuracy of C. This is particularly relevant for the sequential design 
approach, where one is able to track Lrmc{C ^^'>as the designs are augmented. Figure 

6 shows the evolution of := as function of the macro-design size k for 

''(k) 

two different time steps t and the same 2D Max-Call contract. We observe that L) ^ is generally 
decreasing in A:; the SUR criterion intuitively makes the latter a random walk with negative drift 

given by Also, we observe that approximation errors are larger for later time steps, 

'^(k) 

i.e. t I—)• L) Ms increasing. This is because Dt is based on the underlying distribution of W; for 
larger t the volume of the effective support of the latter grows, lowering the density of the design 
sites. This increases surrogate variance v‘^{x) and in turn raises the local loss in 

(3.8). 

While the values of L\ ^ are not sufficient on their own to yield a confidence interval for 
the option price (due to the complex error propagation), self-assessment allows the possibility 

of adaptive termination of Algorithm 1. For example, one could implement Algorithm 1 until 
'^(k) 

Li < Tol for a pre-specified tolerance level. From above discussion, this would lead to larger 
experimental designs for t close to T which in particular would counteract error propagation. 



Figure 6. Evolution of for the 2D Max-Call problem of Section 6.2. We 

'^(k) 

plot the empirical integrated loss L) ' at t = 1 and t = 2, as well as the average 
Ave{L\ ) across 30 runs of Algorithms 1-2 to illustrate Monte Carlo variance and 
the average convergence rate of integrated loss. The sequential design utilized the 
ZC-SUR acquisition function (5.6). 


6. Numerical Experiments 

We proceed to implement our RMC methods on three different types of Bermudan options in 
dimensions d = 2,3,5. Our examples are based on previously published parameter sets, allowing 
direct comparison of relative accuracy. To benchmark our approach, we compare against two 
implementations of the conventional LSMC, namely global polynomial bases, and local piecewise- 
linear bases from the method of Bouchard and Warin [6]. The latter consists of sub-dividing X 
into equi-probable rectangular sub-domains and separately regressing against {x^,... ,x'^} in 
each cell. This strategy does not require user-specified basis function selection for (2.12), although 
it is sensitive to the chosen number of sub-domains. In each case study, to enable “apples-to- 
apples” comparison of different RMC methods, all approximations of So:T are evaluated on a 
fixed out-of-sample set of Nout = 100,000 paths of (Xt). 
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6.1. Benchmarked Examples. While our first visualizations were in 1-D, to offer more realistic 
setups we use as benchmarks two-dimensional and 3-dimensional models. In both cases we work 
with independent and identically distributed geometric Brownian motion factors X^,... ,X^. As 
a first example we consider a two-dimensional basket Put where each asset follows the 

GBM dynamics (4.1) under same parameters as in Section 4.4, namely r = 0.06, (5 = 0 ,cj = 0.2, 
T = 1, At = 0.04. The payoff is h{t,x) = e~^^{K — ^ )-|- with A = 40 and Aq = (40,40), 

leading to option value of E(0, Aq) = 1.461. For our second example, we consider a three- 
dimensional rainbow or max-Call payoff h{t,x) = e“'’*(max(x^,x^) — A)_|_. The parameters 
are from Andersen and Broadie [2]: r = 0.05,5 = 0.1, ct = 0.2, Aq = (90,90,90), K = 100, T = 3, 
At = 1/3, with option value of F(0, Aq) = 11.25. 

As a first step, Tables 1-2 show the impact of different experimental designs. We concentrate 
on the space-filling designs, evaluating the randomized LHS design, as well as two low-discrepancy 
sequences (Halton and Sobol), varying the batch sizes M. For completeness, we also compare 
them against the Probabilistic density-based design, and a Sequential (namely ZC-SUR) DoE 
strategies. For the basket Put (see Table 1), the designs operate with the triangular input space 
X = [25, 55]^ n {(xi -|- X2)/2 < A} which covers the relevant in-the-money region of the contract 
and we use M G {4, 20,100} (the QMC sequences were generated in a cube and then restricted 
to the triangular subset). For the max-Call (see Table 2), the designs are on the rectangular 
region X = [50,150]^ with M G {25,80}. In both examples we purposely choose a very limited 
simulation budget (A = 3,000 for the 2-D Put and N = 16,000 for the 3-D Call) to accentuate 
the difference between the methods. 

Not surprisingly, fully capturing the shape of the exercise boundary with just a handful of 
design sites (even in the idealized setting where one obtains a noise-free sample of C{t,x) at an 
optimal collection x^''^ ) is impossible. Nevertheless, compared to having tens of thousands of 
design sites in LSMC, being able to get away with a couple hundred macro-sites A' < 200 is 
almost magical. In this example we also see that the probabilistic design performs very well, 
partly because the Put is ATM originally. At the same time the performance of an LHS design is 
a little bit off, and the two low-discrepancy sequences perform essentially the same. We stress that 
differences of less than 0.1 cents in V (0, Aq) are essentially negligible, so that the main conclusion 
from Table 1 is that the precise structure of a space-filling DoE design seems to play only a minor 
role. 

Relative to conventional LSMC, the more sophisticated kriging meta-model introduces signif¬ 
icant regression overhead. In standard RMC, about 95% of simulation time is spent on path- 
generation and the rest on regression; in our approach this allocation is turned on its head with 
> 90% of time devoted to regression. This becomes even more extreme in sequential methods, 
where the time cost of simulations becomes essentially negligible. The regression cost arises due to 
the 0(A'^) complexity of making predictions (3.5) with the kriging model; as the size A' = A/M 
of the macro design increases, this becomes the dominant expense of the whole algorithm. As a 
result, the running time of our method is very sensitive to M. For comparison, at M = 100, a 
single run for the 2-D Put problem took 8 seconds in our computing environment, M = 20 took 
20 seconds, and M = 4 took 190 seconds. The situation is even more acute for the sequential DoE 
that is an order of magnitude slower; for that reason we have not run M = 4 (which takes many 
minutes) in that case. The above facts indicate that batching is essential to adequate performance 
of the kriging RMC. A full investigation into the optimal amount of batching (the choice of M 
in (3.7)) is beyond the scope of this paper. However, based on our experience, we suggest that 
M ~ 50 would be appropriate in a typical financial context, offering a good trade-off between 
computational speed-up from replication and maintaining an adequate number of distinct design 
sites. 
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Design/Batch Size 


M = 4 

M = 20 

M = 100 

Probabilistic 

1.458 

(0.002) 

1.448 

(0.003) 

1.443 

(0.006) 

LHS 

1.453 

(0.002) 

1.446 

(0.004) 

1.416 

(0.033) 

Sobol QMC 

1.454 

(0.002) 

1.448 

(0.002) 

1.454 

(0.002) 

Halton QMC 

1.456 

(0.003) 

1.447 

(0.003) 

1.452 

(0.003) 

Sequential ZC-SUR 


A/A 

1.438 

(0.003) 

1.450 

(0.003) 


Table 1. Performance of different DoE approaches to RMC in the 2-D Bermu¬ 
dan Put setting of Section 6.1. All methods utilize |'Dt| = 3000. Results are 
averages (standard deviations) of 100 runs of each method, evaluating P(0, Aq) on 
a fixed out-of-sample database of N^ut = 10^ scenarios. For comparison, LSMC- 
BWll algorithm yielded estimates of Xq) = 1.431 with N = 10,000 and 

yBM/ii(o,Xo) = 1.452 with N = 50,000. 


Design/Batch Size 

M = 25 

M = 80 

Probabilistic 

11.115 

(0.015) 

11.128 

(0.015) 

LHS 

11.071 

(0.025) 

11.110 

(0.029) 

Sobol QMC 

11.131 

(0.020) 

11.177 

(0.015) 

Halton QMC 

11.120 

(0.017) 

11.183 

(0.017) 

Sequential ZC-SUR 

11.160 

(0.025) 

11.147 

(0.014) 


Table 2. Performance of different DoE approaches to RMC in the 3-D Bermudan 
Max-Call setting of Section 6.1. All methods utilize \'Dt\ = 16,000. Results are 
averages (standard deviations) of 100 runs of each method, with V (0, Aq) estimated 
from a fixed out-of-sample database of Nout = 10® scenarios. For comparison, 
LSMC-BWll algorithm yielded estimates of Xq) = 11.07 with N = 10® 

and V^^^\0,Xo) = 11.12 with A = 3 • 10®. 


Table 3 illustrates the effect of choosing different kriging kernels /C; we consider the Matern- 
5/2 (3.4), Gaussian/squared-exponential (3.3) and Matern-3/2 families [37]. In each case kernel 
hyper-parameters are htted in a black-box manner by the DiceKriging software and underlying 
design was space-filling using the Sobol sequence. As expected, optimizing the kernel gives more 
accurate results, improving the previous, fixed-kernel answers in Tables 1-2 above; it also adds 
about 20% additional time. The other conclusion is that the choice of the kernel family has a 
negligible effect on performance. Hence, for the rest of the experiments we work with the default 
choice of a squared-exponential kernel. 

From the above experiments, the following guidance can be given for implementing kriging 
RMC: pick M in the range 50-100 which is sufficient to estimate reasonably well, and reduce 
macro design size A' significantly; use whatever kernel family and hyper-parameter optimization 
the software recommends. As a goodness-of-fit check, examine posterior variance n^(x) along the 
estimated stopping boundary d&t —it should not be too large. 

6.2. Scalability in State Dimension. The max-Call setting is convenient for investigating 
scalability of our approaches in the dimension d of the problem. In contrast to basket Put, the 
stopping region of a max-Call consists of several disconnected pieces. Namely, it is optimal to stop 
if exactly one asset is in-the-money X^ > K while all other assets are below the strike K. This 
generates d disconnected stopping sub-regions in dimension d. One consequence is that selecting a 
good domain X for a space-filling design, such as one of the low-discrepancy sequences, is difficult 
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Kernel 2-D Basket Put 3-D Max Call 

Matern-5/2 1.452 (0.00243) 11.191 (0.0139) 

Matern-3/2 1.450 (0.00252) 11.191 (0.0136) 

Gaussian 1.453 (0.00273) 11.172 (0.0160) 

Table 3. Performance of different kriging models for the 2D Put and the 3D Call. 
The table reports V (0, Xq) and its Monte Carlo standard deviation (in parentheses) 
across algorithm runs. Design sizes were \'D\ = 3,000 in 2-D and \'D\ = 16,000 
in 3-D and used the Sobol QMC design. Results are based on averaging 100 runs 
of each method, and evaluating K(0, Xq) on a fixed out-of-sample database of 
Xout = 100, 000 scenarios. 


since the continuation region is non-convex and the stopping region is disconnected. We use 
X = [50,150]^^ which encompasses the relevant parts of both stopping and continuation regions; 
hence, in this situation we build our metamodel for both in-the-money and out-of-the-money 
paths. The latter X is rather large, hurting the accuracy of space-filling designs. 

Following [2] we benchmark for d = 2,3,5. Table 4 shows the results from two different 
implementations of the LSMC algorithm (namely a global polynomial (“Poly”) basis, as well as the 
localized Bouchard-Warin (“BWll”) [6] scheme, and comparing them to two kriging metamodels, 
one using a space-filling design utilizing the low-discrepancy Sobol sequence (“Krig-fSobol”) and 
the other a sequential ZC-SUR design (“Krig-I-SUR”). To show the advantage of our approach, we 
use an order-of-magnitude smaller design for Krig-fSobol, and an even smaller one for the adaptive 
Krig-I-SUR method. For large d, the standard LSMC approach requires hundreds of thousands of 
X-trajectories which may impose memory constraints, so these savings are impressive. 

In terms of overall simulation budget measured by the number of 1-step simulations of X 
(see fourth column of Table 4), the use of better experimental designs and kriging meta-models 
enables a factor of 2-5 savings, which is smaller than the reduction in N because we do not 
exploit the “global grid” trick of the conventional approach. In the last five-dim. example we also 
see the limitation of the kriging model due to excessive regression overhead as the macro-design 
size N' = N/M rises. As a result, despite using less than 20% of the simulation budget, the 
overall running time, is much longer than for LSMC. With the present implementation using 
DiceKriging, this overhead crowds out simulation savings around N' > 500. Table 4 suggests 
that while attractive conceptually, the gains from a sequential design strategy are marginal in 
terms of memory, and are expensive in terms of additional overhead. Rephrasing, one can squeeze 
most of the savings via a space-filling design. This observation offers a pragmatic guidance for the 
DoE aspect, indicating that one must balance simulation-budget savings against regression/DoE 
overhead. Of course, this trade-off keeps changing as more efficient large-scale kriging/RSM 
software is developed/used (e.g. localized kriging, see [15]). 


6.3. Stochastic Volatility Examples. Lastly, we discuss stochastic volatility (SV) models. 
Because there are no (cheap) exact methods to simulate the respective asset prices, discretization 
methods such as the Euler scheme are employed. This makes this class a good case study on 
settings where simulation costs are more significant. Our case study uses the mean-reverting SV 
model with a two-dimensional state X = (X^, A^): 


dXj = rXl dt + dlU/, 

dXf = a (mi — X|) dt + u dW^ . 


where X^ is the asset price and the log-volatility factor. Volatility follows a mean-reverting 
stationary Ornstein-Uhlenbeck process with mean reversion rate a and base-level mi, with a 
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Method 


U(0,Ao) 

(StDev.) 

#Sims 

Time (secs) 

2D Max call 

LSMC Poly 

N=50,000 

7.93 

(0.018) 

3.60 • 10^ 

3.9 

LSMC BWll 

N=50,000 

7.89 

(0.023) 

3.60 • 10^ 

4.0 

Krig -|- Sobol 

N=10,000 

8.12 

(0.019) 

2.54-10^ 

5.3 

Krig + SUR 

N=6,000 

8.10 

(0.024) 

1.44 • 10^ 

10.8 

3D Max Call 

LSMC Poly 

N=300,000 

10.95 

(0.01) 

2.70 • 10*’ 

15 

LSMC BWll 

N=300,000 

11.12 

(0.01) 

2.70 • 10® 

22 

Krig -|- Sobol 

N=20,000 

11.18 

(0.02) 

0.48 • 10® 

33 

Krig + SUR 

N=16,000 

11.15 

(0.02) 

0.51 • 10® 

161 

5D Max Call 

LSMC Poly 

N=800,000 

15.81 

(0.01) 

7.20 • 10*’ 

42 

LSMC BWll 

N=640,000 

16.32 

(0.02) 

5.76 • 10® 

87 

Krig -|- Sobol 

N=32,000 

16.31 

(0.02) 

0.86 • 10® 

205 

Krig + SUR 

N=25,000 

16.30 

(0.02) 

0.66 • 10® 

666 


Table 4. Comparison of RMC methods for different max-Call models. Results 
are averages across 100 runs of each algorithm, with third column reporting the 
corresponding standard deviations of R(0,Xo). Time is based on running the R 
code on a 1.9 MHz laptop with 8Gb of RAM. The LSMC Polynomial method used 
a tensor product basis of degree 3 (in 2-D and 3-D) and degree 2 in 5-D. The 
LSMC BWll method used 10^ partitions for d = 2, 5^ partitions for d = 3 and 4® 
partitions for d = 5. Kriging models used the squared-exponential kernel. 


leverage effect expressed via the correlation d{W^= pdt. Simulations of are 

done based on an Euler scheme with a time-step 5t. We consider the asset Put h{t,x^,x‘^) = 
e~^^{K — x^)+; exercise opportunities are spaced out by At 5t. Related experiments have been 
carried out in [35], and recently in [1]. 

Table 5 lists three different parameter sets. Once again we consider the kriging meta-models 
(here implemented with LHS for a change) against LSMC implementation based on the BWll 
[6] algorithm. Table 5 confirms that the combination of kriging and thoughtful simulation design 
allows to reduce N by an order of magnitude. Compare for the first parameter set a kriging model 
with LHS design T> that uses N = 10,000 (and estimates option value at 16.06) relative to the 
LSMC model with N = 128,000 (and estimates option value of 16.05). For the set of examples 
from [1] we observe that the space-filling design does not perform as well, which most likely is 
due to having the inefficient rectangular LHS domain X = [30,50] x [—4,1]. Still, the sequential 
design method wins handily. 

Table 5 also showcases another advantage of building a full-scale metamodel for the continuation 
values. Because we obtain the full map x i—)• our method can immediately generate an 

approximate exercise strategy for a range of initial conditions. In contrast, in conventional LSMC 
all trajectories start from a fixed Xq. In Table 5 we illustrate this capability by re-pricing the Put 
from [1] after changing to an out-of-the-money initial condition Ag = 110 (keeping Ag = — 1). 
For the kriging approach we use the exact same metamodels that were built for the ITM case 
(Ag = 90), meaning the policy sets &o-t remain the same. Thus, all that was needed to obtain 
the reported Put values R(0, Aq) was a different set of test trajectories to generate fresh out-of- 
sample payoffs. We note that the obtained results (compare P^'’*®(0, Aq) = 8.27 with budget of 
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Method F(0,Xo) (StDev.) ^ of Sims Time (secs) 


Rambharat & Brockwell [35] Case SV5 
r = 0.0225, a = 0.015, m = 2.95, i/ = 3/^/2, p = -0.03 
K = 100,Xoi = 90,Xg = log0.35,T = 50/252, At = 1/252, = 1/2520 


LSMC BWll 

N=48,000 

15.89 

(0.08) 

2.50 • lO'^ 

24 

LSMC BWll 

N=128,000 

16.03 

(0.05) 

6.25 • 10® 

52 

Krig + LHS 

N=4000 

15.58 

(0.22) 

0.96 • 10® 

25 

Krig + LHS 

N=10,000 

16.06 

(0.05) 

4.25 • 10® 

155 

Krig + SUR 

N=4000 

16.24 

(0.04) 

2.32 • 10® 

283 

Krig + SUR 

N=10,000 

16.30 

(0.05) 

5.69 • 10® 

565 

Agarwal, Junejj 

1 & Sircar ITM [1] Xq = 90 



r = 0.1, a = 1, 

m = —2, ly 

= V2,p 

= -0.3 


K = 

100, Xl = 90, Xl 

= -i,r = 

1, At = 

0.05, (5t = 0.001 


LSMC BWll 

N=50,000 

14.82 

(0.03) 

1.00 • 10® 

36 

LSMC BWll 

N=125,000 

14.95 

(0.02) 

2.50 • 10® 

82 

Krig + LHS 

N=5000 

14.73 

(0.05) 

0.26 • 10® 

13 

Krig + LHS 

N=20,000 

14.81 

(0.02) 

1.05 • 10® 

85 

Krig + SUR 

N=4000 

14.91 

(0.03) 

0.25 • 10® 

50 

Krig + SUR 

N=10,000 

14.96 

(0.02) 

0.58 • 10® 

121 

Agarwal, Juneja 

& Sircar OTM [1] Xq = 110 


LSMC BWll 

N=50,000 

8.25 

(0.02) 

1.25 • 10® 

26.7 

LSMC BWll 

N=125,000 

8.30 

(0.02) 

3.12 • 10® 

67.3 

Krig + LHS 
Krig + LHS 

N=5000 

N=20,000 

8.27 

8.30 

(0.02) 

(0.01) 

N/A 

N/A 

N/A 

N/A 


Table 5. Comparison of RMC methods for Bermudan Puts under stochastic 
volatility models (6.1). Results are averages across 100 runs of each algorithm, 
with third column reporting the corresponding standard deviations of R(0, Xq). 


N = 5000 to R^'^^‘"(0, Xq) = 8.25 with N = 50,000) still beat the re-estimated LSMC solution 
by several cents. 


7. Conclusion 

We investigated the use of kriging meta-models for policy-approximation based on (2.11) for 
optimal stopping problems. As shown, kriging allows a flexible modeling of the respective contin¬ 
uation value, internally learning the spatial structure of the model. Kriging is also well-suited for 
a variety of DoE approaches, including batched designs that alleviate non-Gaussianity in noise 
distributions, and adaptive designs that refine local accuracy of the meta-model in the regions 
of interest (and demand a localized, non-parametric fit). These features translate into savings of 
50-80% of simulation budget. While the time-savings are often negative, I believe that kriging 
is a viable, and attractive regression approach for RMC. First, in many commercial-grade ap¬ 
plications, memory constraints are as important as speed constraints. Speed is generally much 
more scalable/parallelizable than memory. Second, as implemented, the main speed bottleneck 
comes from 0{{N')‘^) cost of building a kriging model; alternative formulations get around this 
scalability issue, and more speed-ups can be expected down the pike. Third, kriging brings a suite 
of diagnostic tools, such as a transparent credible interval for the fitted C that can be used to 
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self-assess accuracy. Fourth, through its analytic structure kriging is well suited for automated 
DoE strategies. 

Yet another advantage of kriging, is a consistent probabilistic framework for joint modeling of 
C and its derivatives. The latter is of course crucial for hedging: in particular Delta-hedging is 
related to the derivative of the value function with respect to x. In the continuation region we 
have dxV {t, •) = dxC{t, •), so that a natural estimator for Delta is the derivative of the meta-model 
surrogate A(t,x) = dxC{t,x), see e.g. [44, 21]. For kriging metamodels, the posterior distribution 
of the response surface derivative dxC{t, •) is again Gaussian and thus available analytically, with 
formulas similar to (3.5)-(3.6). We refer to [10] for details, including an example of applying 
kriging for computing the Delta of European options. Application of this approach to American- 
style contracts will be explored separately. 

The presented meta-modeling perspective modularizes DoE and regression, so that one can 
mix and match each element. Consequently, one can swap kriging with another strategy, or al¬ 
ternatively use a conventional LSMC with a space-filling design. In the opinion of the author, 
this is an important insight, highlighting the range of choices that can/need to be made within 
the broader RMC framework. In a similar vein, by adding a separate pre-averaging step that 
resembles a nested simulation or Monte Carlo forest scheme (see [5]), our proposal for batched 
designs highlights the distinction between smoothing (removing simulation noise) and interpola¬ 
tion (predicting at new input sites), which are typically lumped together in regression models. 
To this end, kriging meta-models admit a natural transition between modeling of deterministic 
(noise-free) and stochastic simulators and are well-suited for batched designs. 

Kriging can also be used to build a classification-like framework that swaps out cross-sectional 
regression to fit C with a direct approach to estimating 6. One solution is to convert the 
continuous-valued path payoffs Ht{x^) into labels: Zt{x) := l{Ht{x^) < h{t,x)) and build a 
statistical classification model pt for pt{x) := P(Zt(x) = l|x); for example a kriging probit model. 
Then <3t := {x : pt{x) > 0.5}. Modeling Z rather than Y might alleviate much of the ill-behavior 
in the observation noise e. These ideas are left for future work. 

By choice, in our experiments we used a generic DoE strategy with minimal fine-tuning. In fact, 
the DoE perspective suggests a number of further enhancements for implementing RMC. Recall 
that the original optimal stopping formulation is reduced to iteratively building T meta-models 
across the time steps. These meta-models are of course highly correlated, offering opportuni¬ 
ties for “warm starts” in the corresponding surrogates. The warm-start can be used both for 
constructing adaptive designs Vt (e.g. a sort of importance sampling to preferentially target the 
exercise boundary of dGt+i) and for constructing the meta-model (e.g. to better train the kriging 
hyper-parameters {s‘^,9)). Conversely, one can apply different approaches to the different time- 
steps, such as building experimental designs of varying size Nt, or shrinking the surrogate domain 
Xf ss t ^ 0. These ideas will be explored in a separate forthcoming article. 

As mentioned, ideally the experimental design T> is to be concentrated along the stopping 
boundary. While this boundary is a priori unknown, some partial information is frequently avail¬ 
able (e.g. for Bermudan Puts one has rules of thumb that the stopping boundary is moderately 
ITM). This could be combined with an importance sampling strategy to implement the path- 
generation approach of standard LSM while generating a more efficient design. See [13, 23, 14] 
for various aspects of importance sampling for optimal stopping. Another variant would be a 
“double importance sampling” procedure of first generating a non-adaptive (say density-based) 
design, and then adding (non-sequentially) design points/paths in the vicinity of estimated dGf 
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